Cross-linguistic patterns of speech prosodic differences in autism: A machine learning study

Differences in speech prosody are a widely observed feature of Autism Spectrum Disorder (ASD). However, it is unclear how prosodic differences in ASD manifest across different languages that demonstrate cross-linguistic variability in prosody. Using a supervised machine-learning analytic approach, we examined acoustic features relevant to rhythmic and intonational aspects of prosody derived from narrative samples elicited in English and Cantonese, two typologically and prosodically distinct languages. Our models revealed successful classification of ASD diagnosis using rhythm-relative features within and across both languages. Classification with intonation-relevant features was significant for English but not Cantonese. Results highlight differences in rhythm as a key prosodic feature impacted in ASD, and also demonstrate important variability in other prosodic properties that appear to be modulated by language-specific differences, such as intonation.


Acoustic feature extraction
For each of the 20 utterance samples from every participant's narrative data, the following series of acoustic features were extracted: 1. Rhythm: Three types of acoustic features were derived from all utterances to comprehensively capture aspects of speech rhythm.
i. The speech envelope spectrum (ENV) represents temporal regularities correlating to rhythmic properties of the signal [4,5]. For each utterance, the vocalic energy amplitude envelope was first derived. To derive the envelope, the raw time series of the utterance was first chunked into consecutive bins of one second. Following Tilsen and Arvaniti [5], the time series of each chunk was filtered with a passband of 400-4000 Hz to de-emphasize non-vocalic energy such as glottal energy (including the f0) and obstruent noise. The bandpass-filtered signal was then low-pass filtered with a cutoff of 10 Hz to represent the envelope. The frequency decomposition of the envelope was then computed. First, the envelope was downsampled by a factor of 100 and windowed using a Tukey window (r = 0.1) to aid further spectral analyses. The envelope was then normalized by subtracting the mean and rescaled to have minimum and maximum values of -1 and 1 respectively. A fast Fourier transform was first applied to the normalized envelope which was also zero-padded to a 2048-sample window. The spectra across all one-second chunks of each utterance were then averaged to form the envelope spectrum of the utterance, each consisting of 1660 values.
ii. The intrinsic mode functions (IMFs) were further computed from the time-varying speech envelope (as described above) using empirical mode decomposition (EMD), representing syllabic and supra-syllabic-level fluctuations relevant to speech rhythm [6]. The frequency decompositions of IMF1 and IMF2 (i.e., the averaged power spectrum density of 1-10 Hz from the frequency decomposition all IMF1s and IMF2s across all 1-second envelop chunks of each utterance) were included as features, each consisting of 1660 values.
iii. The temporal modulation spectrum (TMS) is the frequency decomposition of the temporal envelope of a signal that reflects how fast sound intensity fluctuates over time [7]. Temporal modulation of lower frequencies (< 32 Hz) is a primary acoustic correlate of perceived rhythm in speech [8,9], which contributes to speech intelligibility [10]. For each utterance, the raw time series was first chunked into consecutive bins of 1 second. The TMS of each 1-second bin was then computed using the procedure and MATLAB script from Ding and colleagues [7]. In the procedure, the sound signal in each bin was first decomposed into narrow frequency bands using a cochlear model and then from each band the temporal envelope was extracted. The extracted envelopes were rescaled using a logarithmic function, and were then converted into the 2. Intonation: The fundamental frequency (f0) contour for each utterance were derived to represent its intonation. For each utterance, a raw f0 contour was first derived using the pitch function of the Audio Toolbox in MATLAB [9]. f0 values of the contour were estimated using a Normalized Correlation Function algorithm [11] with analysis windows that spanned 52 ms and overlapped with adjacent analysis windows for 42 ms. To minimize pitch tracking errors, pitch tracking ranges dependent on speaker age and gender (see Supplementary Table 1) were implemented in the algorithm [2]. Given that the duration of each utterance varied, a time normalization procedure [12,13] was further performed to obtain an f0 contour that was uniform in size across all utterances.
The procedure took 20 f0 values of the raw f0 contour at equal proportional intervals. The 20 f0 values were then concatenated to form a time-normalized f0 contour. Statistical significance of each set of classifications was assessed using a permutation approach. A null distribution of AUC values was computed by repeating the same cross-validation procedure for 5001 times with the diagnosis labels of participants randomized each time. The percentage of AUC values from the permuted model that was equal to or higher than the median AUC from the actual classification was taken as the p-value [16]. To adjust for multiple comparison of a total of 2 sets of classifications (using features relevant to intonation and rhythm respectively), each p-value was adjusted using a Bonferroni correction.
2 Post-hoc analysis: ruling out gender and age effects on f0 feature

English ML classification
Results of Model 1 suggest that f0 features can be used to effectively classify ASD and TD diagnostic categories in English samples but not in Cantonese samples. Because of uneven gender ratios across groups (see Table 1 of main article), fundamental gender-related differences in pitch level could have been a confounding factor that impacted f0-based ASD vs. TD classifications (although potential gender-related f0 differences did not drive successful classification in Cantonese samples using f0 features, likely due to our resampling procedure of subjects in addition to the smaller sample size of adults where larger gender differences were expected). Further, the significant age differences across English ASD and TD groups (see Table 1 of main article) may have increased the discriminability of pitch across the two groups, as compared to Cantonese ASD and TD groups which did not differ significantly in age. To rule out gender and age as potential confounding factors in the f0-based classification in English, a post-hoc analysis employing a resampling procedure using age-matched males and females was performed.

Methods: Matching males and females and their age with resampling
In the post-hoc analysis, 5001 iterations of SVM classification using f0 features was performed according to the nested 10-fold CV procedures described in the ML analysis pipeline. Yet, before each iteration of classification, the feature array of f0 was resampled such that the male-to-female ratios of samples across ASD and TD groups were the same in the array. This resampling process involved randomly selecting samples of 4 out of 18 females with to match those of the 4 females with ASD. Vice versa, samples from 15 out of 29 males with ASD were randomly selected to match with those from the 15 males with TD. The age of all participants from the two resampled groups were then submitted to a two-sample t-test. In the iteration, the random resampling procedure was performed again until the age differences of the resampled ASD and TD groups were not significant, based on a stringent criteria of (p > .500) of the t-test. As a result, the feature array in each iteration consists of f0 features from 38 subjects, with both ASD and TD age-matched groups each consisting of 15 males and 4 females. The permutation procedure with 5001 iterations were also performed on the resampled feature array in each iteration of resampling. The p-value of the classification was taken as the percentage of AUC values from the permuted model that were equal to or higher than the median AUC of the actual classification.

Results: Gender-and-age-matched SVM Classification using f0 features
The AUC values of all SVM classifications in this post-hoc analysis are presented in Supplementary Figure 1 (left). With gender-and age-matched samples through resampling, the SVM classification using f0 features was significant (p = 0.0028), achieving a median AUC of 0.833 (corresponding to an accuracy of 0.775, sensitivity of 0.650, and sensitivity of 0.944) which was comparable with, if not qualitatively higher than, the AUC of Model 1 in which gender and age was not matched. This result suggests that neither gender nor age was likely to be a confounding factor in the ASD/TD classification using f0 features derived from English narrative samples.

Supplementary ML analysis: demonstration of cross-linguistic acoustic variability using a ML approach
We assumed that both intonational and rhythmic properties vary systematically across our English and Cantonese utterance samples. To test our assumption that such systematic variability was represented in our speech samples of these two languages, a supplementary analysis was performed. In this analysis, an ML model was trained to classify the two languages using the intonation and rhythm features derived from individuals of English TD and Cantonese TD groups.

Methods: Matching English and Cantonese individuals with TD with resampling
Two set of classifications were performed for each machine-learning model, using features relevant to intonation and rhythm respectively.
In each set of classifications, a total of 5001 iterations of SVM classification was performed according to the the nested 10-fold CV procedures described in the ML analysis pipeline, but this time to classify the two languages. Before each iteration, an undersampling procedure was performed on the English TD group to randomly select 24 samples (out of the overall 33), so as to produce a balanced dataset by matching the smaller number of samples from the Cantonese TD group (N=24). Therefore, the feature array in each iteration of SVM classification consist of input features from 48 subjects. The permutation procedure with 5001 iterations was also performed on the resampled feature array in each iteration of resampling. The pvalue of the classification was taken as the percentage of AUC values from the permuted model that were equal to or higher than the median AUC of the actual classification.
To adjust for multiple comparison of a total of two sets of classifications (on intonation and rhythm features respectively), each p-value was adjusted using a Bonferroni correction.